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Abstract 

Background: The human body plays host to a vast array of bacteria, found in oral cavities, skin, gastrointestinal 
tract and the vagina. Some bacteria are harmful while others are beneficial to the host. Despite the availability of 
many methods to identify bacteria, most of them are only applicable to specific and cultivable bacteria and are also 
tedious. Based on high throughput sequencing technology, this work derives 16S rRNA sequences of bacteria and 
analyzes probiotics and pathogens species. 

Results: We constructed a database that recorded the species of probiotics and pathogens from literature, along 
with a modified Smith-Waterman algorithm for assigning the taxonomy of the sequenced 16S rRNA sequences. We 
also constructed a bacteria disease risk model for seven diseases based on 98 samples. Applicability of the proposed 
platform is demonstrated by collecting the microbiome in human gut of 13 samples. 

Conclusions: The proposed platform provides a relatively easy means of identifying a certain amount of bacteria 
and their species (including uncultivable pathogens) for clinical microbiology applications. That is, detecting how 
probiotics and pathogens inhabit humans and how affect their health can significantly contribute to develop a 
diagnosis and treatment method. 



Background 

High throughput sequencing can analyze a large amount 
of sequences, enabling sequencing of 16S rRNA to identify 
complex bacteria species of pathogens and probiotic bac- 
teria. Many naturally occurring bacteria form a complex 
population in the environment. The human body plays 
host to a vast array of bacteria, found in oral cavities, skin, 
gastrointestinal tract and the vagina. Some bacteria are 
harmful while others are beneficial to the host. 

A pathogen is a microorganism that causes disease in its 
host. For example, bacterial pathogen include Corynebacter- 
ium diphtheria (causes diphtheria), Listeria monocytogenes 
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(causes food poisons), and Legionella pneumophila (causes 
Legionnaires' disease). Probiotics, another microorganism, 
benefit the host and has received considerable attention in 
recent years. A FAO report in 2001 [1] cited the advan- 
tages of probiotics as increasing immunity [2,3], reducing 
gastrointestinal discomfort [4,5], and protecting the flora 
within urogenital tract [6]. As is well known, probiotics 
can ameliorate symptoms of diseases [7] and reduce the 
risk of suffering from diseases [8,9]. 

Despite the availability of many approaches to identify 
probiotics and pathogens, most of them are only applic- 
able to specific and cultivable bacteria but time consum- 
ing. For instance, conventional methods detect growth 
of cultured bacteria in approximately two days, or an 
additional five days to obtain no-growth culture results 
[10], which is laborious. Besides, some bacteria cannot 
be cultured [11], subsequently increasing the difficulty of 
specifying pathogenic bacteria. Moreover, it is hard to 
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determine whether an infection is caused by one or 
more bacteria types. 

16S rRNA sequences, capable of identifying bacteria 
on a molecular level, can detect uncultivable bacteria 
[12]. Use of 16S rRNA sequencing can overcome some 
problems of conventional culture method [13]. Although 
16S rRNA sequencing is a more effective means of 
identifying bacteria than conventional culture method, 
16S rRNA sequencing takes a considerable amount of 
time in amplifying DNA sequences [14]. Sanger sequen- 
cing known as "first-generation" or "conventional" sequen- 
cing has been used for DNA sequencing for almost two 
decades. Next generation sequencing (NGS) can analyze 
large-scale sequences quicker, enable massively parallel 
analysis, reduce reagent costs and the size of sample com- 
ponents, and perform high throughput [15]. Thus NGS is 
more efficient than the Sanger method, which generates 
one read per sample. In addition, NGS of 16S rRNA more 
easily identify cultivable or uncultivable bacteria [12]. 

Because of the improvement of sequencing technology 
and Bioinformatics approaches, the accuracy in distin- 
guishing bacteria with those methods has been increased. 
Based on high throughput sequencing technology, this 
work identifies 16S rRNA sequences of bacteria and ana- 
lyzes bacteria species. High-throughput sequencing can 
sequence a large number of 16S rRNA sequence more 
efficiently; with high-throughput sequencing, researchers 
can acquire information to identify pathogens and 
probiotic bacteria [16-18]. 

Results 

Platform application: gut probiotics and pathogens 
detection 

The read statistics of quality filtering and taxonomy 
assignment are demonstrated in Table 1. Figure 1A 



illustrated the percentage of probiotics detected by 
the proposed platform. Table 2 listed the quantities 
(matched sequenced reads) of probiotics identified in 
the samples in the case study. The top three identi- 
fied probiotics in 12 samples are Lactococcus salivar- 
ius, Streptococcus thermophilus, and Bifidobacterium 
longum. Figure IB and Table 3 listed the proportion 
and quantities of pathogens, of which top three pathogens 
are Escherichia coli, Salmonella enteric, and Haemophilus 
influenza. 

Table 4 listed the results of disease risk evaluations. It 
showed that three diseases of two samples (B031 and 
B034) had similar distributions in the control group. 
The three diseases are obesity, colorectal cancer, and 
constipation. Sample B031 had reached the significance 
level with P-value 0.0333 and 0.0121 < 0.05 of distribu- 
tion in constipation and colorectal cancer respectively 
compared to 98 samples control group using binomial 
test. Sample B034 had reached the significance level with 
P-value 0.00257 and 0.0121 < 0.05 of distribution in 
obesity and colorectal cancer. Evaluated by the associ- 
ation of bacterial risk markers and disease, the results 
suggested that these two samples had higher risk than 
98 samples control group in constipation, colorectal 
cancer, and obesity. Their enterotypes of gut probiotics 
and pathogens may be one of risk factors which would 
cause disease. 

Reproducibility and accuracy evaluation of proposed 
platform 

Two replicated experiments of four samples were per- 
formed to estimate the reproducibility of the proposed 
platform. The results of repeated experiments were con- 
sistent. The similarity between two repeated experiments 
were calculated by using UniFrac [19]. Results of each 



Table 1 Results of quality filtering and taxonomy assignment 



ID 


Raw reads 


QC 




Bacteria identified 


Probiotics 


Pathogens 


B011 


125420 


117451 


93.65% 


90952 


77.44% 


60 


0.07% 


3509 


3.86% 


B012 


132240 


120134 


90.85% 


94679 


78.81% 


3457 


3.65% 


20109 


21.24% 


B013 


151876 


142585 


93.88% 


99025 


69.45% 


3452 


3.49% 


21341 


21.55% 


BOH 


134619 


126784 


94.18% 


95377 


75.23% 


611 


0.64% 


6665 


6.99% 


B016 


135457 


126507 


93.39% 


89407 


70.67% 


49 


0.05% 


20870 


23.34% 


B017 


141682 


131968 


93.14% 


89465 


67.79% 


1064 


1.19% 


8944 


10.00% 


B018 


111228 


102382 


92.05% 


56981 


55.66% 


910 


1 .60% 


11630 


20.41% 


B019 


128532 


120719 


93.92% 


76877 


63.68% 


305 


0.40% 


2775 


3.61% 


B020 


128441 


121957 


94.95% 


89618 


73.48% 


123 


0.14% 


3673 


4.10% 


B031 


140941 


132311 


93.88% 


97962 


74.04% 


2129 


2.17% 


5194 


5.30% 


B033 


142462 


134554 


94.45% 


80548 


59.86% 


229 


0.28% 


2725 


3.38% 


B034 


148854 


140059 


94.09% 


106050 


75.72% 


9857 


9.29% 


15436 


14.56% 


Total 


1621752 


1517411 


93.54% 


1066941 


70.31% 


22246 


2.09% 


122871 


11.52% 
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(A) Probiotics identified in samples 
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(B) Pathogens identified in samples 
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Figure 1 Relative abundance of probiotics and pathogenic bacteria from human gut of all samples. (A) The percentage of probiotics was 
identified in the samples. (B) The proportion of pathogenic bacteria was identified in the samples in the case study. 



Table 2 The quantities (matched sequenced reads) of probiotics identified in the samples in the case study 



Probiotics 


B011 


B012 


B013 


BOM 


B016 


B017 


B018 


B019 


B020 


B031 


B033 


B034 




Bacillus coagulans 


0 


81 


6 


1 


2 


0 


3 


0 


1 


0 


9 


1 


104 


Bifidobacterium adolescentis 


4 


3 


1520 


81 


1 


372 


185 


177 


1 


5 


0 


375 


2724 


Bifidobacterium animalis 


0 


101 


37 


3 


1 


16 


32 


1 


1 


0 


0 


50 


242 


Bifidobacterium bifidum 


0 


3 


3 


0 


0 


84 


2 


0 


0 


0 


0 


21 


113 


Bifidobacterium breve 


0 


1092 


465 


96 


6 


102 


212 


13 


2 


9 


18 


79 


2094 


"Bifidobacterium longum 


3 


1859 


1092 


198 


27 


256 


439 


34 


5 


15 


55 


238 


4221 


Lactobacillus brevis 


0 


0 


0 


0 


0 


0 


1 


0 


0 


0 


0 


10 


11 


Lactobacillus easel 


0 


10 


1 


1 


0 


0 


0 


0 


0 


1 


0 


0 


13 


Lactobacillus ferment urn 


0 


0 


0 


0 


1 


0 


0 


4 


0 


1 


0 


28 


34 


Lactobacillus gasseri 


0 


0 


0 


1 


1 


0 


0 


0 


0 


0 


0 


77 


79 


Lactobacillus johnsonii 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


7 


7 


Lactobacillus paracasei 


0 


1 


2 


0 


0 


0 


0 


0 


0 


0 


0 


1 


4 


Lactobacillus plantarum 


1 


0 


0 


2 


0 


0 


0 


0 


0 


0 


0 


0 


3 


Lactobacillus reuteri 


0 


0 


0 


1 


0 


0 


0 


0 


0 


0 


0 


1 


2 


Lactobacillus rhamnosus 


0 


1 


0 


0 


0 


0 


0 


0 


0 


0 


0 


2 


3 


"Lactobacillus salivarius 


2 


1 


2 


8 


1 


5 


3 


1 


3 


11 


2 


6753 


6792 


Lactococcus lactis 


2 


0 


0 


6 


1 


0 


2 


0 


0 


16 


1 


10 


38 


"Streptococcus thermophilus 


48 


305 


324 


213 


8 


229 


31 


75 


110 


2071 


144 


2204 


5762 




60 


3457 


3452 


611 


49 


1064 


910 


305 


123 


2129 


229 


9857 





For each species, if the number of reads is 0 for all samples, that species was not shown. 

*The leading three probiotics are Lactococcus salivarius, Streptococcus thermophilus and Bifidobacterium longum. 
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Table 3 The quantities (matched sequenced reads) of pathogens identified in the samples in the case study 



Pathogens 


B011 


B012 


B013 


BOM 


B016 


B017 


B018 


B019 


B020 


B031 


B033 


B034 




Bordetello pertussis 


0 


1 


0 


0 


0 


1 


0 


0 


0 


0 


0 


0 


2 


Brucella abortus 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


Brucella melitensis 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


Campylobacter jejuni 


0 


0 


0 


11 


0 


0 


0 


0 


0 


11 


1 


40 


63 


Clostridium botulinum 


0 


38 


5048 


4 


5 


2 


1 


361 


153 


1211 


115 


59 


6997 


Clostridium difficile 


0 


0 


1 


0 


0 


0 


0 


0 


0 


0 


0 


0 


1 


Clostridium perfringens 


0 


1 


2 


0 


0 


0 


0 


3 


10 


24 


12 


93 


145 


Corynebacterium diphtheriae 


0 


1 


0 


1 


0 


0 


0 


0 


0 


1 


0 


0 


3 


Enterococcus faecal is 


57 


13 


1 


4 


8 


4 


0 


20 


5 


19 


6 


38 


175 


Enterococcus faecium 


41 


8 


2 


6 


5 


2 


1 


22 


3 


13 


1 


32 


136 


"Escherichia coli 


1744 


8560 


7900 


3637 


10651 


4404 


5691 


1424 


1733 


210 


165 


4483 


50602 


"Haemophilus influenzae 


2 


1771 


2 


1055 


8 


49 


1 


171 


15 


1802 


2322 


4502 


11700 


Neisseria meningitidis 


0 


2 


0 


3 


1 


0 


1 


1 


1 


1 


1 


1 


12 


Pseudomonas aeruginosa 


1 


6 


6 


4 


2 


2 


3 


1 


3 


0 


0 


3 


31 


"Salmonella enterica 


1570 


9291 


7978 


1849 


9864 


4209 


5726 


622 


1658 


303 


44 


4495 


47609 


Shigella sonnei 


41 


243 


239 


32 


308 


122 


192 


8 


41 


1 


1 


98 


1326 


Staphylococcus aureus 


0 


0 


0 


0 


0 


0 


0 


0 


1 


0 


0 


0 


1 


Staphylococcus epidermidis 


0 


0 


0 


0 


0 


0 


0 


1 


1 


0 


0 


0 


2 


Streptococcus agalactiae 


0 


69 


3 


0 


5 


0 


1 


0 


0 


3 


6 


5 


92 


Streptococcus pneumoniae 


46 


26 


9 


16 


1 


36 


3 


46 


25 


272 


5 


154 


639 


Streptococcus pyogenes 


7 


76 


149 


14 


10 


112 


9 


94 


23 


417 


45 


1428 


2384 


Vibrio cholerae 


0 


3 


0 


0 


0 


1 


1 


0 


1 


0 


1 


0 


7 


Yersinia pestis 


0 


0 


1 


29 


2 


0 


0 


1 


0 


906 


0 


5 


944 




3509 


20109 


21341 


6665 


20870 


8944 


11630 


2775 


3673 


5194 


2725 


15436 





For each species, if the number of reads is 0 for all samples, that species was not shown. 

*The leading three pathogens are Escherichia coli, Salmonella enterica and Haemophilus influenzae. 



sample pair (replicate 1 and 2) closely resemble each 
other. The similarity of UniFrac distance of each sample 
pair is higher than 0.96 (0.9617 for B014, 0.9872 for 
B018, 0.9914 for B020, 0.9722 for B033). This implies 
that the analysis results are reproducible. 

Next, accuracy of the platform is evaluated by adding 
Lactobacillus reuteri to a stool sample (B050). Sample 
B050 contains 24,408 assigned taxons, and Lactobacillus 
reuteri has no detected count. Whether the counts of 
this species in positive control sample (B050S_L) are 
elevated must be determined. Analysis results indicate 
that 27,113 taxons are detected in sample B050S_L. In 
fact, the detected counts of Lactobacillus reuteri in sam- 
ple B050S_L are 1,430, and the percentage of Lactobacil- 
lus reuteri markedly increases from 0% to 5%. 

In short, our platform is accurate and reproducible in 
terms of detecting the quantities of bacterial species of 
the proposed platform. The results evaluate the accuracy 
and feasibility of proposed platform in order to identify 
probiotics and pathogens. While requiring only about 
one day for detection, not limited in identifying certain 



bacteria, the proposed platform can detect and quantify 
multiple bacteria simultaneously. 

Discussion 

Because of the constraint of costs and technical limita- 
tions, 16S rRNA sequences obtained in most databases 
are partial sequences. Many studies thus assign tax- 
onomy by using partial 16S rRNA sequences. In our 
probiotics and pathogens 16S rRNA sequence database, 
17,964 sequences are collected from NCBI nucleotide 
database, NCBI 16S microbial rRNA database, Green- 
genes database, and SILVA. Our probiotics and patho- 
gens 16S rRNA database contain less than 39% of 16S 
rRNA sequences which are longer than 1400 bps. Only 
9% of the sequences are close to full length. 

This work extracts the V4 region from full length 16S 
rRNA of microbiome in the human gut as a platform 
application. Some 16S rRNA variable regions are more 
dependable than other regions in assigning taxonomy 
like V3 and V4 [20,21]; in addition, some 16S rRNA 
variable regions are much conserved. The proportion 



Table 4 The result of disease risk evaluations of 12 samples 



Disease 


B011 


B012 


B013 


BOM 


B016 


B017 


B018 


B019 


B020 


B031 


B033 


B034 


Constipation 


2.67E-01 


2.67E-01 


2.67E-01 


1 DOE + 00 


1 DOE + 00 


2.67E-01 


1 DOE + 00 


1 DOE + 00 


2.67E-01 


3.34E-02 


2.67E-01 


1 DOE + 00 


Obesity 


1.34E-01 


1.34E-01 


1.00E + 00 


1.34E-01 


1 DOE + 00 


1.34E-01 


1.34E-01 


1 DOE + 00 


1 DOE + 00 


1.34E-01 


1 DOE + 00 


2.57E-03 


IBS 


3.33E-01 


7.06E-01 


1.00E + 00 


3.33E-01 


1 DOE + 00 


3.33E-01 


3.33E-01 


1 DOE + 00 


1.10E-01 


1.10E-01 


7D6E-01 


3.33E-01 


Ulcerative colitis 


9.30E-02 


4.15E-01 


1.00E + 00 


4.15E-01 


1 DOE + 00 


9.30E-02 


4.15E-01 


1 DOE + 00 


9.30E-02 


1 DOE + 00 


1 DOE + 00 


1 DOE + 00 


Colorectal cancer 


4.88E-01 


2.59E-01 


9.35 E-01 


7.47E-01 


7.47E-01 


7.47E-01 


2.59E-01 


4.88E-01 


4.88E-01 


1.22E-02 


7.47E-01 


1.22E-02 


Atopic dermatitis 


1.83E-01 


1 .00E + 00 


1.00E + 00 


1 DOE + 00 


1 DOE + 00 


1.83E-01 


1.83E-01 


1 DOE + 00 


1 DOE + 00 


1 DOE + 00 


1 DOE + 00 


1.83E-01 


Allergic rhinitis 


1.89E-01 


1 DOE + 00 


1.00E + 00 


1 DOE + 00 


1 DOE + 00 


1.89E-01 


1.89E-01 


1.89E-01 


1 DOE + 00 


1.89E-01 


1 DOE + 00 


1.89E-01 



The bold numbers represent two samples had reached significance level with P-value less than 0.05 of distribution in three diseases compared to 98 sample control group using evaluation model. 
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and diversity of probiotics and pathogens may be made 
diverse by using different 16S rRNA variable regions. 
The proposed platform is also applicable to other 16S 
rRNA variable regions for taxonomy assignment. Im- 
portantly, a more appropriate region than others must 
be selected to produce an outcome that is close to full 
length 16S rRNA sequence. 

This work further attempt is to collect common probio- 
tics and pathogens from the literature. Although it may be 
incomplete, recent advances in sequencing technology 
make it possible to identify and define an increasing 
number of bacteria, implying an obvious increase in the 
number of identified probiotics and pathogens in the 
future. Efforts are underway in our laboratory to update 
the list of used probiotics and pathogens. 

Previous studies [22-24] identified pathogen or pro- 
biotic bacteria by using antibody, 16S rRNA gene micro- 
arrays, fluorescence in situ hybridization (FISH), and 
proteomic methods. In this work, the proposed platform 
can detect various pathogens and probiotics based on 
16S rRNA (rDNA) sequences of bacteria using NGS and 
Bioinformatics method. An average of 126,451 reads was 
acquired per sample in this work. It is doubt that the 
sequencing depth is enough to detect a small amount of 
probiotics and pathogens. Although increasing the cover- 
age of sequencing can advance the sensitivity of detecting 
probiotics and pathogens, the sequencing cost will in- 
crease. It is important to work out an appropriate cover- 
age of sequencing for detecting probiotics and pathogens. 

The results of disease risk evaluations revealed that 
most of 12 samples did not have resembled distributions 
of bacteria markers with control group. Only two sam- 
ples had reached the significance level of distributions. 
The reason for the phenomenon may be the overlapped 
bacteria markers between diseases. 28 markers are used 
in colorectal cancer, and 17 markers are used in irritable 
bowel syndrome. Six markers are overlapped. For sample 
B031, the significant distributions in colorectal cancer 
were partly contributed to the significance in irritable 
bowel syndrome owing to the overlapped markers. Simi- 
larly, two overlapped markers for sample B034 were in 
colorectal cancer and obesity. In this kind of speculation, 
the influence of colorectal cancer to irritable bowel syn- 
drome would be six (overlapped markers of CC and IBS) 
over seventeen (markers of IBS), and the influence of 
colorectal cancer to obesity would be two (overlapped 
markers of CC and obesity) over nine (markers of obes- 
ity). In addition, the influence of colorectal cancer to 
constipation and ulcerative colitis would be one over six 
and two over ten, respectively. 

In addition to that some bacteria markers in species 
level are belong to the marker of genus level and spe- 
cies level, both genus marker and species markers 
may have associated with affecting the distributions 



mutually. Continually, collecting more markers and 
evaluating the distributions with markers in the same 
level are required for constructing a global prediction 
model in Taiwanese. 

Conclusions 

This work constructed a bacterial disease risk evaluation 
model for seven diseases and developed a novel platform 
by using NGS and Bioinformatics approach. Compared 
with the traditional bacteria culture method, our pro- 
posed platform can reduce experiment time. Besides, the 
proportion of probiotics and pathogens (including 
uncultivable pathogens) in the human body can be 
detected rapidly with 16 s RNA database of probiotics 
and pathogens. Furthermore, the proposed platform pro- 
vides further insight into the cause of disease based on 
the relation of probiotics, pathogens, and disease. For 
instance, the type of antibiotics can be adjusted if the 
pathogens of disease are identified from infected patients. 
In addition, the proposed platform allows researchers to 
determine whether the intake of probiotics impacts the 
human body [25-29]. In the future, this preliminary study 
will be continuously extended for more bacterial disease 
markers. For more comprehensive applications, this work 
will also collect bacteria from other parts of human body 
as control group data. In fact, a detective method of how 
the probiotics and pathogens inhabit human can provide 
new insight for human health. It could improve diagnosis 
and treatment method. 

Methods 

Figure 2 illustrates the bioinformatics system flow of the 
proposed platform, which includes analysis pipeline of 
NGS. The Figure 2 contains four parts: sequence quality 
filtering, construction of bacteria sequence database, 
taxonomy assignment, and disease risk model evaluation. 
The detailed components in the proposed platform are 
described below. 

Sample collection 

In this study, stool samples of 98 Taiwan volunteers 
were gathered. The samples were collected by Sigma- 
transwab (Medical Wire) into a tube with Liquid Amies 
Transport Medium, and stored at 4°C until processing. 

DNA extraction 

In the case study, fresh faeces were obtained from partici- 
pants. DNA was extracted directly on stool samples by 
using a QIAamp DNA Stool Mini Kit (Qiagen). A swab 
was vortexed vigorously and incubated at room tempe- 
rature for 1 min. The sample was then transferred to 
microcentrifuge tubes containing 560 [A Buffer ASL, 
vortexed, and incubated at 37°C for 30 min. In addition, 
the suspension was incubated at 95°C for 15 min, 
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16S rRNA sequence data quality filtering 



Construction of bacteria sequence database 



Greengenes database 

SILVA rRNA database 

is 16S rRNA sequence database 

" 99 pathogenic bacteria were 
collected from literature 




Disease risk evaluation 



II 

Control group collection 



Bacterial disease marker collection 

Disease risk model evaluation 



Detection of probiotics and pathogens 



Figure 2 System flow of bioinformatics analysis in the proposed platform. The proposed platform comprises the analysis pipeline of 
NGS, construction of probiotics and pathogens database, bacterial disease risk model evaluation and the application of individualized bacteria 
sequencing profile. 

V ) 



vortexed, and centrifuged at 14,000 rpm for 1 min into 
pellet stool particles. Extraction was performed following 
the protocol of the QIAamp DNA Stool Mini Kit. The 
DNA was eluted with 50 ul Buffer AE, and centrifuged at 
14,000 rpm for 1 min. Moreover, the DNA extract was 
stored at-20°C until further analysis. Finally, DNA extrac- 
tion was performed, depending on the sample collected. 

Library construction and sequencing for V4 region of 16S 
ribosomal DNA 

The PCR primers, F515 (5-GTGCCAGCMGCCGCGG 
TAA-3) and R806 (5 -GG ACTACHVGGGTWTCTA 
AT-3'), were designed to amplify the V4 domain of 
bacterial 16S ribosomal DNA as described previously 
[30]. PCR amplification was performed in a 50 ul 
reaction volume containing 25 ul 2X Taq Master Mix 
(Thermo Scientific), 0.2 uM of each forward and reverse 
primer, and 20 ng DNA template. The reaction condi- 
tions consisted of an initial 95°C for 5 min, followed by 
30 cycles of 95°C for 30 sec, 54°C for 1 min, and 72°C 
for 1 min, as well as a final extension of 72°C for 5 min. 
Next, amplified products were checked by 2% agarose 
gel electrophoresis and ethidium bromide staining. 
Amplicons were purified using the AMPure XP PCR 
Purification Kit (Agencourt), and quantified using Qubit 
dsDNA HS Assay Kit (Qubit) on Qubit 2.0 Fluorometer 
(Qubit) -all according to respective manufacturer in- 
structions. For V4 library preparation, Illumina adapters 
were attached to the amplicons using the Illumina 



TruSeq DNA Sample Preparation v2 Kit. Purified librar- 
ies were applied for cluster generation and sequencing 
on the MiSeq system. The raw sequence files are avail- 
able for download at http://clinic.mbc.nctu.edu.tw/. 

16S rRNA (rDNA) sequence data quality filtering 

The raw fastq files obtained by Illumina sequencing ma- 
chine were quality- filtered using the FASTX-Toolkit a . 
The paired-end 150 bp reads were performed using the 
minimum acceptable phred quality score of 20, as well 
as the 70% of bases that must exceed 20 phred quality 
score. Sequence shorter than 100 nucleotides would be 
omitted after quality trimming from reads tail. Notably, 
reads containing ambiguous characters were discarded. 

Construction of probiotics and pathogens database 

The list of probiotics and pathogens were obtained from 
literatures or the claims of official departments. Additional 
file 1: Table SI lists species of probiotics which were 
adapted from both literatures [7,9] and the claims of 
official departments, such as Taiwan Food and Drug 
Administration [31] and Health Canada [32]. 99 bacterial 
pathogens were collected from literature [25,26,33-42] and 
Taiwan Food and Drug Administration [31] (Additional 
file 1: Table S2). 

The 16S rRNA sequences of probiotics and pathogens 
used for taxonomy mapping were retrieved from the 
NCBI nucleotide database, NCBI 16S microbial rRNA 
database, Greengenes database [43] and SILVA [44]. 



Chiu ef al. Journal of Clinical Bioinformatics 2014, 4:1 
http://www.jclinbioinformatics.eom/content/4/1/1 



Page 8 of 13 



Table 5 Disease-related biomarkers of seven diseases 



Disease 


Marker 


Correlation 


Lower bound 


Upper bound 


Case 


Control 


Pubmed ID 


Constipation 


Escherichia coli 


- 


2.86E-03 


1.52E-01 


35 


35 


20039451 




Roseburia 


- 


1.41 E-03 


4.61 E-02 


14 


12 


22315951 




Lactobacillus 


- 


6.10E-05 


9.45 E-03 


14 


12 


22315951 




Bifidobacterium 


- 


5.39E-05 


1.73E-02 


14 


12 


22315951 




Enterobacteriaceae 


+ 


1 .00E-02 


4.26E-01 


14 


12 


22315951 




Ruminococcus bromii 


+ 


1.16E-05 


4.98E-03 


8 


15 


20014457 


Obesity 


Prevotella 


_ 


246E-03 


5.36E-01 


23 


13 


20876719 




Bifidobacterium 


- 


5.39E-05 


1.73E-02 


33 


30 


19498350 




Lachnospiraceae 


- 


3.11 E-03 


6.74E-02 


3 


3 


19164560 




Verrucomicrobiae 


- 


1.43E-05 


1 .78E-02 


3 


3 


19164560 




Akkermansia 




1.43E-05 


1 .78E-02 


3 


3 


19164560 




Faecalibacterium prausnitzii 


+ 


7.70E-04 


2.1 5 E-02 


15 


13 


19849869 




Lactobacillus 


+ 


6.10E-05 


9.45 E-03 


20 


20 


19774074 




Coriobacteriaceae 


+ 


3.26E-05 


4.72E-03 


3 


3 


19164560 




Erysipelotrichaceae 


+ 


1.35E-04 


6.64E-03 


3 


3 


19164560 


Ulcerative colitis 


Bacteroides uniformis 


- 


7.63E-04 


5.44E-02 


13 


22 


21073731 




Bacteroides vulgatus 


- 


1.55E-03 


4.21 E-02 


13 


22 


21073731 




Parabacteroides distasonis 




2.22E-05 


1 .68E-03 


13 


22 


21073731 




Faecalibacterium prausnitzii 


- 


7.70E-04 


2.1 5 E-02 


13 


27 


19235886 




Firmicutes 


- 


9.18E-02 


4.50E-01 


13 


27 


19235886 




Clostridium 




2.48E-03 


6.03E-02 


31 


30 


21253779 




Clostridium leptum 


- 


9.65 E-06 


1.05 E-03 


13 


27 


19235886 




Bifidobacterium 




5.39E-05 


1.73E-02 


13 


27 


19235886 




Bacteroides ovatus 




2.04E-04 


1.81 E-02 


13 


22 


21073731 




Escherichia coli 


+ 


2.86E-03 


1.52E-01 


9 


9 


16954244 


Atopic dermatitis 


Lactobacillus 




6.10E-05 


9.45 E-03 


68 


256 


17604093 




Bifidobacteriales 


- 


8.09E-05 


1 .84E-02 


7 


27 


20626364 




Bacteroides 


+ 


6.56E-02 


6.37E-01 


68 


256 


17604093 




Clostridium perfringens 


+ 


O.OOE + 00 


1 .06E-04 


15 


15 


21963389 


Colorectal cancer 


Bacteroides uniformis 




7.63E-04 


5.44E-02 


46 


56 


21850056 




Roseburia 


_ 


1.41 E-03 


4.61 E-02 


46 


56 


21850056 




Fusobacterium 


- 


3.32E-05 


2.64E-02 


50 


38 


7574628 




Eubacterium 


- 


1 .36E-03 


7.92E-02 


46 


56 


21850056 




Coprococcus 


- 


1.91E-05 


2.89E-03 


21 


23 


20740058 




Collinsella aerofaciens 


- 


2.39E-05 


2.09E-03 


50 


38 


7574628 




All 'stipes 


- 


4.07E-04 


2.60E-02 


46 


56 


21850056 




Sutterellaceae 


_ 


9.39E-04 


4.85 E-02 


46 


56 


21850056 




Escherichia 


+ 


3.05E-03 


1.85E-01 


46 


56 


21850056 




Shigella 


+ 


1.51 E-03 


8.84E-02 


46 


56 


21850056 




Bacteroides fragilis 


+ 


7.22E-06 


1 .92E-02 


46 


56 


21850056 




Porphyromonas 


+ 


O.OOE + 00 


1 .59E-05 


46 


56 


21850056 




Faecalibacterium prausnitzii 


+ 


7.70E-04 


2.1 5 E-02 


50 


38 


7574628 




Ruminococcus albus 


+ 


O.OOE + 00 


4.95 E-04 


50 


38 


7574628 




Streptococcus 


+ 


1.12E-04 


6.83E-03 


46 


56 


21850056 
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Table 5 Disease-related biomarkers of seven diseases (Continued) 



Irritable bowel syndrome 



Allergic rhinitis 



Bloutio honsenii 


+ 


n nnc i nn 
U.UUL + UU 


^ we nc 


c:n 
oU 


0 Q 
JO 


~jc~j A AOQ 

/j/40zo 


Enterococcus 


+ 


n nnc i nn 
U.UUb + UU 


1 .1 9E-04 


46 


56 


o 1 o enne a 
z I ooUUoo 


Dll lUUuULLcilUlii UiiyUluLUfii 


+ 


U.UUC t UU 


Z.DUC UJ 


JU 


JO 


7C7/1 AOQ 
/ j/^fOZO 


Bloutio producto 


+ 


n nnc i nn 
U.UUL + UU 


1 .1 OE-04 


c:n 
oU 


0 Q 
JO 


~7C~7 A AOQ 

/j/40zo 


Ruminococcus gnovus 


+ 


1 n/i c nc; 
I .U4t-Uj 


0 (^A c n^3 
Z.D4t-U3 


c:n 
oU 


3 Q 
JO 


7^7/1 AOQ 

/j/40zo 


Eubocterium eligens 


+ 


7.09E-05 


2.05 E-02 


50 


38 


/j/40zo 


Eubocteriunn rectole 


+ 




I .o4L-Uz 


c:n 
oU 


0 Q 
JO 


~7C~7 A AOQ 

/j/40zo 


Bocteroides stercoris 


+ 


A Q7E n^ 


o o/i c m 
Z.y4t-Uz 


c:n 
oU 


3 Q 
JO 


7^7/1 AOQ 

/j/40zo 


En ter obocterioles 


+ 


1 nnc no 
I .UUb-Uz 


4.26E-01 


1 0 


1 0 


O 1 A/l 7007 

z I o4/zz/ 


Erysipelotnchoceoe 


+ 


1 qcc n/i 


^ A/1 C HQ 

D.o4b-Uo 


c:n 
OU 


0 Q 
JO 


~7C~7 A AOQ 

/j/40zo 


Doreo 


+ 


5.67E-05 


noc no 
D.Uob-Uo 


21 


23 


on7/i nnco 
zU/4UUoo 


Bifidobocterium longum 


+ 


1 .56E-05 


3.60E-03 


50 


38 


/j/40zo 


Foecolibocterium 


+ 


1 .66E-03 


6.79E-02 


21 


23 


">n~7 /i nnc o 

zU/4UUoo 


Bocteroides uniformis 




7.63 E-04 


5.44E-02 


1 1 


22 


O 1 f\737Q 1 

Z I U/d/d I 


Bocteroides vulgotus 




1 .55E-03 


4.21 E-02 


1 1 


22 


Z I U/D/D I 


Porobocteroides distosonis 




o ooc nc; 
Z.ZZt-Uo 


1 aqp n^ 


1 1 


00 

zz 


01 n7373 1 

Z I U/d/d I 


Foecol ibocterium pro usnitzii 




7.70E-04 


2.1 5 E-02 


23 


23 


OOQ onQ7D 

zzooyo/y 


Bocteroidetes 




Z.O/ t _ U I 


/.yot-u l 


AO 
OZ 


A A 
40 


01 Qono.o.0 

z i ozuyyz 


Bifid o bocteriu m 




c 3nc nt 


I ./ob-Uz 


OZ 


40 


o 1 Qonnno 

z i ozuyyz 


Bocteroides ovotus 




o n/i c n/i 
Z.U4L-U4 


i oir no 
1 .0 1 L-Uz 


1 1 


00 

zz 


0 1 fl73 7Q 1 

z I U/d/d I 


Foecolibocterium 




1 .66E-03 


6.79E-02 


62 


46 


o 1 oonnno 

z i ozuyyz 


Escherichio coli 


+ 


0 Q^C HQ 

z.oob-Uo 


1 .52E-01 


1 4 


1 Q 
I O 


OOQ C ACQ7 
ZZOjOjO/ 


Hoemophilus 


+ 


1 .02E-05 


1 .69E-03 


22 


22 


o 1 7/i 1 no 1 

z i /4 i yz I 


Fusobocterium 


+ 


3.32E-05 


2.64E-02 


23 


23 


OOQ 30070 

zz5oyo/y 


Gommoproteobocterio 


+ 


1 7cii m 


4.oyb-U I 


00 

zz 


00 

zz 


o 1 7/i 1 no 1 
z I /4 I yz I 


Ruminococcus 


+ 


1 top n^ 
I .zzt-Uo 


a noc no 
4.Uot-Uz 


AO 
DZ 


A a 
40 


o 1 oonnno 

z i ozuyyz 


Enterococcus 


+ 


O.OOE + 00 


1.19E-04 


23 


23 


22339879 


Veillonello 


+ 


1.12E-05 


7.82E-03 


26 


26 


19903265 


Loctobocilloceoe 


+ 


6.10E-05 


9.45 E-03 


23 


23 


22339879 


Doreo 


+ 


5.67E-05 


6.08E-03 


62 


46 


21820992 


Lactobacillus 




6.10E-05 


9.45 E-03 


12 


12 


19714856 


Bifidobacterium 




5.39E-05 


1.73E-02 


67 


20 


101 


Bocteroides fragilis 


+ 


7.22E-06 


1 .92E-02 


22 


22 


17893165 


Foecolibocterium prousnitzii 


+ 


7.70E-04 


2.1 5 E-02 


22 


22 


17893165 



The associations between bacterium and disease are majorly collected from case-control studies which the quantities of bacterium are obtained from deep 
sequencing data. The proportion of 78 bacteria from control group was applied as risk markers (constipation: 6, obesity: 9, IBS: 17, UC: 10, CC: 28, AD: 4, AR: 4) to 
predict disease risk to seven diseases in this study. 



Following sequence data collection, we assemble partial 
sequences which used the same species classification 
and removed redundant sequences. Additionally, we also 
removed the unique sequence from only one research 
support with 3% similarity which shared the same species 
classification with other sequence. 

Taxonomy mapping 

To generate taxonomy assignments, the proposed platform 
invoked a modified Smith- Waterman algorithm from 



miRExpress [45], which can compare pairs of sequences in 
parallel, for mapping reads to taxons. miRExpress was 
designed for identifying the best similarity between sequen- 
cing reads and miRNA precursor sequences. In our model, 
it was modified for identifying multiple hits of 16S rRNA 
sequence mapping results with similarity threshold 0.97. In 
order to reduce the storage space of output, the SAM 
format [46] was used to replace the original miRExpress 
output format for storing alignment results. Furthermore, 
two kinds of output format were designed. One format 
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records whole mapped sequencing reads based on taxons. 
The other one records which taxons could be assigned 
based on sequencing reads. These two kinds of output 
could support the important information for assigning 
sequencing reads to suitable taxon. miRExpress was origin- 
ally designed for dealing with single-end sequencing data. 
Therefore, the additional program was added for process- 
ing paired-end sequencing data. In this part, both end 
sequencing reads need to be assigned to the same taxon. If 
paired-end sequencing reads were mapped to different 
taxons, this paired sequence would be dropped. The 
probiotics and pathogens 16S rRNA sequence from our 
database were built in FASTA format. Following quality 
filtering, all paired-end sequences were aligned to the pro- 
biotics and pathogens database with whole read aligned 
from one end to the other end. Reads were then truncated 
with an identity lower than 97%, according to previous 
research in order to achieve a better compromise between 
sequences from PCR sequencing errors and taxonomic 
relatedness [27]. 

The construction of Bacterial disease risk evaluation 
model (BDREM) 

To study the associations between bacteria and diseases, 
we collected related information from literatures. We con- 
cerned bacteria that are associated with seven diseases: 



constipation [28,47,48], obesity [29,49-52], irritable 
bowel syndrome (IBS) [28,53-58], ulcerative colitis 
(UC) [53,59-61], colon cancer (CC) [62-64], Atopic 
Dermatitis (AD) and Allergic rhinitis (AR), were col- 
lected positive correlation and negative correlation 
data, and the individual risk of disease was evaluated. 

The association data were majorly collected from 
case-control studies which the quantities of bacteria 
were obtained from NGS data, and few well-known 
bacteria validated by multiple studies through cultural 
experiments were also included. We further eliminated 
some conflicted data with both positive and negative 
correlation between bacteria and disease in different 
studies. 

Health Asians stool samples of 98 Taiwan volunteers 
were gathered. Following deep sequencing and sequen- 
cing data processing, the proportion of 78 bacteria from 
control group was applied as risk markers (constipation: 
6, obesity: 9, IBS: 17, UC: 10, CC: 28, AD: 4, AR: 4) to 
predict disease risk to seven diseases in this study 
(Table 5). 

The mathematical formula of BDREM in this study 
was developed as the following steps. Let A be a N x S 
matrix, where N is the number of markers selected in 
the prediction model of constipation and S is the num- 
ber of health subjects in 7 prediction models. T t was 



Disease risk model evaluation 



The proportion data of subjects from NGS 



Using the lower and upper proportion 
bound of 9 markers from 98 control 
samples to define risk markers of subjects 



Using binomial test performing an exact 
test of a simple null hypothesis about the 

probability of success in a Bernoulli 
experiment 



The association between bacteria and 
diseases from subjects 



TAXONOMY 


B034 


8031 


Prevotella 


2.1E-07 


3.7E-02 


Bifidobacterium 


2.7E-06 


2.2E-01 


Lachnospiroceae 


5.5E-02 


1.4E-06 


terrucomicrobiae 


1.7E-03 


1.7E-06 


Akkermansia 


5.2E-03 


2.1E-03 


Faecalibacterium 
prausnitzii 


S.1E-03 


4.2E-05 


Lactobacillus 


2.1E-07 


3.7E-04 


Coriobocterioceae 


2.7E-02 


2.2E-03 


Erysipelotrichoceoe 


5.5E-02 


1.4E-06 



TAXONOMY 


Correlation 


Lower bound 


Upper bound 


Prevotella 




2.46E-03 


5.36E-C1 


B-fioooacterium 




5.35E-C5 


i 73E-:: 


Lachnospiraceae 




3.11E-03 


6.74E-02 


Verrucomicrobiae 




1.43E-05 


1.78E-02 






1.43E-05 


1.78E-02 


Faecalibacterium 
prausnitzii 




7.70E-04 


2.15E-02 


Lactobactlus 




6.10E-05 


9.45E-03 


Coriobacterioceoe 




3.26E-05 


4.72E-03 


Bryi m KrfdMCHM 




1.35E-04 


S.64E-03 



obesity 


BO 34 


B031 


#high risk markers 




2 


# all markers 


9 


9 


Hypothesized probability of 


0.07239 


0.07239 


P-Value 


2.572e-03 


0.1344 



Threshold and direction 
of high risk region 



B 

r 



Opportunistic 
bacteria 



Figure 3 An example for evaluating the risk of obesity by using bacterial disease risk evaluation model. The model used lower and 
upper proportion bound of 9 markers from 98 control samples to define risk markers of these two samples (B034 and B031) following by using 
binomial test. 
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defined as one of the two notches of median for each 
row of A [65]. T t is a threshold to distinguish A# from 
normal proportion level to abnormal (fail to success in 
one trail of binomial distribution). Smaller notch was 
selected to T t when each marker was recorded as a 
negative association to the disease, and a success trail 
was identified when A# is smaller than T t , On the oppos- 
ite, larger notch was selected when association was 
positive, and a success trail was identified when A^ is 
larger than T t . 



Ti 



Medainof {Aa,A a , ...,A^} + 



1.58 x IQR 

Vs 



positive association 

1.58 x IQR 



Medain of {A/i,A&, ...,Aj_} 

negative association 



Let Pj be the probability of successful trails in the f h 
column of A. The meaning of Pj is the personal probabil- 
ity that abnormal proportion level happened. 



pj 



# success trails in the jth column of A. 
N 



Let Ph be the mean of Pj. It represents how frequent 
the abnormal proportion level happened to all Pj in 
average, regarded as the hypothesized probability of 
success in each Pu 



N 

Assume Pj obey a binomial distribution, and let P h be 
the hypothesized probability (0.05051 for constipation, 
0.07239 for obesity, 0.06952 for IBS, 0.05227 for UC, 
0.09280 for CC, 0.04924 for AD, 0.05114 for AR). A bi- 
nomial test was used to Pj and P h . Alpha = 0.05 was 
choose to judge if a subject is significantly differently 
from the others in A. 

Figure 3 illustrated an example for evaluating the risk 
of obesity of B034 and B031. The model used lower and 
upper proportion bound of 9 markers from 98 control 
samples to define risk markers of these two samples fol- 
lowing by using binomial test. Four markers of B034 ex- 
ceed the lower bound and upper bound of obesity. The 
binomial test P-Value of B034 is 2.572e-03 < 0.05, Since 
P-Value < = hypothesized probability 0.07239, this case is 
specifically associated (significantly) with disease than 
random chance. There are two markers of B031 ex- 
ceed lower bound of obesity. The P-Value of B031 is 
0.1344 > 0.05, the case is no more associated with 
disease than random chance. As the results, we can 
assume that B034 had higher probability to cause 
Obesity. 



Endnote 

a http://hannonlab.cshl.edu/fastx_toolkit/index.html. 
Additional file 



Additional file 1: The list of probiotics and pathogens were 
obtained from literatures or the claims of official departments: 
Table SI. The reference list of probiotics. Table S2. The reference list of 
pathogens. 
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